Use this template to complete your project throughout the course. Your Final Project presentation will be based on the contents of this document. Replace the title/name above and text below with your own, but keep the headers. Feel free to change the theme and other display settings, but this is not required.
#Give a brief a description of your project and its goal(s), what data you are using to complete it, and what two faculty/staff in different fields you have spoken to about your project with a brief summary of what you learned from each person. Include a link to your final project GitHub repository. Social determinants of health, such as inadequate housing, lack of access to reliable transportation, and insufficient food access, have increasingly been recognized as being factors that adversely impact health outcomes (https://health.gov/healthypeople/priority-areas/social-determinants-health). In this project, with guidance from advisors John Holmes, PhD and Sameh Saleh, MD, I will attempt to determine if there is an observed association between low food access and type 2 diabetes prevalence in the United States for the most recent year for which data about both were available, 2015. It is possible that there is a positive association between decreased food access and type 2 diabetes prevalence, as individuals in low food access areas may have little choice but to rely on convenience foods and fast foods for their nutritional intake, which is likely to increase the risk of developing type 2 diabetes. I will use the AHRQ Social Determinants of Health dataset for this work.
Link to final project Github repository: https://github.com/mcgreevj/BMIN503_Final_Project
#Describe the problem addressed, its significance, and some background to motivate the problem.
#Explain why your problem is interdisciplinary, what fields can contribute to its understanding, and incorporate background related to what you learned from meeting with faculty/staff. Type 2 diabetes is a major health problem in the United States, with significant morbidity and mortality for the at least 33 million people who suffer with this condition. Additionally, the economic costs of type 2 diabetes in the aggregate are substantial, totaling $327 billion in 2017 (The Cost of Diabetes | ADA). Tragically, type 2 diabetes is largely a preventable disease that has modifiable risk factors (weight, diet, exercise). Increasingly, there has been recognition that social determinants of health can serve as barriers that prevent individuals from being able to successfully modify risk factors. For example, individuals who lack access to insurance and/or reliable transportation may not be able to seek routine medical care for education, prevention, monitoring, and treatment purposes. As noted, diet is a major factor in the development of type 2 diabetes. And while ideal diets to prevent type 2 diabetes have been advised, not every American necessarily has access to food to be able to consume these recommended diets. This is the phenomenon of food insecurity. Food insecurity is formally defined by the US Department of Agriculture as: “the limited or uncertain availability of nutritionally adequate and safe foods, or limited or uncertain ability to acquire acceptable foods in socially acceptable ways.” (USDA ERS - Measurement). One factor that can lead to lack of access to healthy foods is the phenomenon of food deserts – areas that are low income and that have few or no supermarkets or other sources of healthy food options. As formally defined by the USDA, a food desert is: “A census tract that meets both low-income and low-access criteria including: 1. poverty rate is greater than or equal to 20 percent OR median family income does not exceed 80 percent statewide (rural/urban) or metro-area (urban) median family income; 2. at least 500 people or 33 percent of the population located more than 1 mile (urban) or 10 miles (rural) from the nearest supermarket or large grocery store.” In the absence of options for healthy food, individuals residing in these communities may need to resort to convenience and fast foods for some or many of their meals. In doing so, they increase the risk of developing type 2 diabetes.
This project is interdisciplinary in that it combines census tract, geographical data with health survey data. Notably the AHRQ SDOH database integrates data from multiple surveys and sources. The fields and topics that contribute to understanding of this issue are multiple: public health, clinical medicine, sociology, health disparities, economics, geography, history (in that the reasons many communities have limited access to food involves historical structural decisions, including forms of structural racism such as redlining). Moreover, the findings of this project have social justice, ethical, cultural, economic, and political implications. Based on collaboration with my mentors, I have learned the importance of familiarizing myself with data sources and data definitions, notably the Data Source Documentation guide from AHRQ (AHRQ Social Determinants of Health (SDOH) Database). This step is necessary from an exploratory standpoint to contemplate what kind of project I might do and also from a practical standpoint to understand how to properly and specifically answer my project question with the available data. I have also learned that some questions may be challenging to answer exactly as initially envisioned. For example, when considering this project at first, I had hoped to show changes over time in the association between limited food access and type 2 diabetes prevalence, perhaps including recent data from 2020 or 2021. However, after close inspection of this SDOH data set, I have realized that not all data is available for every year of interest. Thus I have needed to narrow the scope of my project a bit, given that data for limited food access and diabetes prevalence intersect in 2015, which is the most recent year that both types of data are available.
The identification of an independent association between low food access and type 2 diabetes prevalence would help support the case for policies and programs (both at the national and state levels) to improve food access in impacted census tracts. From a social justice standpoint, those with power in society (elected office holders, business and community leaders, members of the health care community including researchers) have a responsibility to seek ways to provide a known and effective preventive therapy (whether a vaccine or a grocery store) to those without such access currently, to prevent disease. There is also a national public health and cost dimension to this topic. People with diabetes suffer complications and morbidity related to diabetes, impairing their ability to support themselves, live independent lives, and contribute to a community’s economic activity and well-being. At the same time, the national cost of providing care for individuals with type 2 diabetes is borne by all Americans, in part because individuals living in low access food areas are more likely to have public (Medicaid, Medicare) insurance, if they have insurance at all, that is funded by all U.S. taxpayers. Indeed, Medicaid and Medicare combined provide health coverage to approximately 36% of Americans today and that percentage continues to rise over time. (Health Insurance Coverage of the Total Population | KFF). Data Note: Three Findings about Access to Care and Health Outcomes in Medicaid (kff.org). “Uncompensated” care for individuals with type 2 diabetes who lack any type of insurance is also a cost borne by individual health care organizations, governments, and ultimately, tax payers. Greater awareness of the relationships between SDOH and health outcomes, combined with a blunt accounting of the costs of inaction on this subject, will hopefully spur actions that can help mitigate SDOH and improve both individual and community health.
For this project, I used the AHRQ Social Determinants of Health database (available at https://www.ahrq.gov/sdoh/data-analytics/sdoh-data.html). I reviewed the data source documentation document (https://www.ahrq.gov/sites/default/files/wysiwyg/sdoh/SDOH-Data-Sources-Documentation-v1-Final.pdf) to learn about potential variables for this project and their availability. Ultimately, I planned to use two main variables available in the SDOH database: CHR_PCT_DM and FARA_TRACT_LILA_HALF_10.
CHR_PCT_DM is defined as the “percentage of adults with diagnosed diabetes (ages 20 and over).” Data for this variable is provided by the CDC Interactive Diabetes Atlas which in turn uses BRFSS data to estimate prevalence by U.S. county. Notably, starting in 2015, BRFSS included both landline and cellphone survey data to obtain diabetes prevalence data. Previously, this data had been collected exclusively via landline surveys.
FARA_PCT_LA_HALF_10_POP is defined as the “percentage of [healthy food] low access population measured at 1/2 mile for urban areas and 10 miles for rural areas.” For this variable, low access to healthy food is defined as being far from a supermarket, supercenter, or large grocery store, so more than 1/2 mile in urban areas and more than 10 miles in rural areas for this variable. #Does low access first require low-income designation for a tract, or is low access independent of low-income status for a tract? That is, a tract could be LA and not be LI?
This variable comes from the Food Access Research Atlas (FARA), created by the Economic Research Service (ERS) at the United States Department of Agriculture. A mapping tool is offered by the ERS and is available here: https://www.ers.usda.gov/data-products/food-access-research-atlas/go-to-the-atlas/.
In contrast to the diabetes prevalence data, FARA data is provided at the census tract level.
Of note, the following is the ERS methodology for calculating distance to the nearest grocery store, excerpted from the ERS definitions document available at https://www.ers.usda.gov/data-products/food-access-research-atlas/documentation/:
“First, the entire country is divided into ½-kilometer (about 1/3-mile)-square grids, and data on the population are aerially allocated to these grids (see Low-Income and Low-Foodstore-Access Census Tracts, 2015-19 in the link below). Distance to the nearest supermarket is measured for each grid cell by calculating the distance between the geographic center of the ½-kilometer square grid that contains estimates of the population (number of people and other subgroup characteristics) and the center of the grid with the nearest supermarket.”
While I had hoped to be able to include recent data for this project, the most recent year that data for both CHR_PCT_DM and FARA_PCT_LA_HALF_10_POP is available is 2015.
After having consulted with my advisors (Dr. Holmes and Dr. Saleh) and choosing the key variables above, I loaded the appropriate datasets into R.
For CHR_PCT_DIABETES, the 2015 data (by county) was available as an Excel file here: https://view.officeapps.live.com/op/view.aspx?src=https%3A%2F%2Fwww.ahrq.gov%2Fsites%2Fdefault%2Ffiles%2Fwysiwyg%2Fsdoh%2FSDOH_2015_COUNTY_1_0.xlsx&wdOrigin=BROWSELINK.
For FARA_PCT_LA_HALF_10_POP, the 2015 data (by tract) was available as an Excel file here: https://www.ahrq.gov/downloads/sdoh/sdoh_2015_tract_1_0.xlsx.
Once loaded into R, I manipulated each of these raw datasets so as to limit them to the variables of interest.
For the tract dataset, I mutated it to create a new variable, MEAN_PCT_LA_COUNTY, that reported the mean low access percentage by county, using group by and summarize functions.
I joined the dataframes containing CHR_PCT_DIABETES and FARA_PCT_LA_HALF_10_POP, respectively, by COUNTYFIPS, which was present in both dataframes.
In preparation for an exploratory data analysis using choropleths, I loaded census county files using practicum code as a guide.
I created a basic U.S. choropleth of low access percentage by county. Similarly, I created abasic U.S. choropleth of diabetes prevalence by county.
I will plan to create a logistic regression model to determine if there is any relationship between low food access and diabetes prevalence. [Is there any way to graphically demonstrate a relationship]
I will consider adjusting the model for other potentially relevant variables such as income, age, obesity if possible.
Result of exploratory data analysis Result of logistic regression model Result after adjusting for other variables [preliminary code below]
Describe your results and include relevant tables, plots, and code/comments used to obtain them. End with a brief conclusion of your findings related to the question you set out to address. You can include references if you’d like, but this is not required.
Limitations There was not an evident way to capture purely type 2 diabetes data from available surveys. Interestingly, the CHR_PCT_DIABETES variable counts any diagnosis of diabetes, whether type 1, type 2, or other. In this project, the diabetes of interest is type 2 diabetes given that type 2 diabetes is associated with modifiable lifestyle factors. Still, given that 90-95% of all U.S. diabetes cases are type 2 (https://www.cdc.gov/diabetes/basics/type2.html#:~:text=Healthy%20eating%20is%20your%20recipe,adults%20are%20also%20developing%20it), this project is still capable of providing a reasonably strong estimate of any association between low access and type 2 diabetes prevalence.
Conclusion
[Describe the data used and general methodological approach. Subsequently, incorporate full R code necessary to retrieve and clean data, and perform analysis. Be sure to include a description of code so that others (including your future self) can understand what you are doing and why.]
library("readxl")#Read in the 2015 low access data, reported by census tract
Lowaccess_raw<- read_excel("C:\\Users\\McGreevJ\\Desktop\\BMI 5030 Data Science\\BMIN503_Final_Project\\Adjusted datasets\\data.sdoh_2015_tract_1_0.xlsx")
dim(Lowaccess_raw)## [1] 74133 405
#Read in the 2015 diabetes prevalence data, reported by county
Diabetes_raw<- read_excel("C:\\Users\\McGreevJ\\Desktop\\BMI 5030 Data Science\\BMIN503_Final_Project\\Adjusted datasets\\data.SDOH_2015_COUNTY_1_0.xlsx")## Warning: Expecting logical in RJ1671 / R1671C478: got '46123'
## Warning: Expecting logical in RJ1763 / R1763C478: got '32510'
## Warning: Expecting logical in RK1763 / R1763C479: got '41025'
## Warning: Expecting logical in RL1763 / R1763C480: got '41037'
## Warning: Expecting logical in RJ2797 / R2797C478: got '49017'
## Warning: Expecting logical in RK2797 / R2797C479: got '49019'
## Warning: Expecting logical in RL2797 / R2797C480: got '49025'
## Warning: Expecting logical in RM2797 / R2797C481: got '49055'
## Warning: Expecting logical in RJ2842 / R2842C478: got '51760'
dim(Diabetes_raw)## [1] 3234 979
library(tidyverse)## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(cowplot)
library(dplyr)
#Limiting Lowaccess_raw to include variables of most interest:
Lowaccess<- dplyr::select(Lowaccess_raw, TRACTFIPS, COUNTYFIPS, STATEFIPS, STATE, COUNTY, REGION, ACS_TOT_POP_WT, ACS_MEDIAN_AGE, ACS_PCT_EMPLOYED, ACS_MEDIAN_HH_INC, ACS_PCT_HH_INC_10000, ACS_PCT_HH_INC_100000, ACS_PER_CAPITA_INC, ACS_PCT_POV_AIAN, ACS_MEDIAN_HOME_VALUE, ACS_PCT_MEDICAID_ANY, ACS_PCT_MEDICAID_ANY_BELOW64, ACS_PCT_MEDICARE_ONLY, ACS_PCT_PRIVATE_ANY, ACS_PCT_UNINSURED, FARA_TRACT_LILA_HALF_10, FARA_PCT_LA_HALF_10_POP, HRSA_MUA_CENSUS_TRACT, POS_DIST_ED_TRACT, CEN_AIAN_NH_IND)
str(Lowaccess) #74,133 x 25## tibble [74,133 × 25] (S3: tbl_df/tbl/data.frame)
## $ TRACTFIPS : chr [1:74133] "01001020100" "01001020200" "01001020300" "01001020400" ...
## $ COUNTYFIPS : chr [1:74133] "01001" "01001" "01001" "01001" ...
## $ STATEFIPS : chr [1:74133] "01" "01" "01" "01" ...
## $ STATE : chr [1:74133] "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ COUNTY : chr [1:74133] "Autauga County" "Autauga County" "Autauga County" "Autauga County" ...
## $ REGION : chr [1:74133] "South" "South" "South" "South" ...
## $ ACS_TOT_POP_WT : num [1:74133] 1948 2156 2968 4423 10763 ...
## $ ACS_MEDIAN_AGE : num [1:74133] 38.8 35.1 40.2 39.7 34.2 32.8 36.3 43.5 37.1 37.5 ...
## $ ACS_PCT_EMPLOYED : num [1:74133] 94.6 86.7 93.8 89.2 95.8 ...
## $ ACS_MEDIAN_HH_INC : num [1:74133] 61838 32303 44922 54329 51965 ...
## $ ACS_PCT_HH_INC_10000 : num [1:74133] 2.3 10.29 8.68 1.05 4.15 ...
## $ ACS_PCT_HH_INC_100000 : num [1:74133] 19.66 13.58 9.46 17.31 23.53 ...
## $ ACS_PER_CAPITA_INC : num [1:74133] 25713 18021 20689 24125 27526 ...
## $ ACS_PCT_POV_AIAN : num [1:74133] 0 NA 0 0 NA NA NA 100 NA NA ...
## $ ACS_MEDIAN_HOME_VALUE : num [1:74133] 149100 97900 110600 134700 174300 ...
## $ ACS_PCT_MEDICAID_ANY : num [1:74133] 16.12 24.49 12.13 4.49 5.65 ...
## $ ACS_PCT_MEDICAID_ANY_BELOW64: num [1:74133] 17.52 25.38 12.7 4.94 6.35 ...
## $ ACS_PCT_MEDICARE_ONLY : num [1:74133] 4.83 3 3.41 3.45 2.2 4 3.48 5.19 3.98 3.99 ...
## $ ACS_PCT_PRIVATE_ANY : num [1:74133] 57.6 42.9 49.7 58.1 66.2 ...
## $ ACS_PCT_UNINSURED : num [1:74133] 11.55 14.08 11.18 15.77 5.89 ...
## $ FARA_TRACT_LILA_HALF_10 : num [1:74133] 0 0 0 0 0 0 1 0 0 0 ...
## $ FARA_PCT_LA_HALF_10_POP : num [1:74133] 90.6 65 82 83.2 72.2 ...
## $ HRSA_MUA_CENSUS_TRACT : chr [1:74133] "0" "0" "0" "0" ...
## $ POS_DIST_ED_TRACT : num [1:74133] 2.23 1.37 0.83 0.55 1.59 ...
## $ CEN_AIAN_NH_IND : chr [1:74133] "0" "0" "0" "0" ...
colnames(Lowaccess)## [1] "TRACTFIPS" "COUNTYFIPS"
## [3] "STATEFIPS" "STATE"
## [5] "COUNTY" "REGION"
## [7] "ACS_TOT_POP_WT" "ACS_MEDIAN_AGE"
## [9] "ACS_PCT_EMPLOYED" "ACS_MEDIAN_HH_INC"
## [11] "ACS_PCT_HH_INC_10000" "ACS_PCT_HH_INC_100000"
## [13] "ACS_PER_CAPITA_INC" "ACS_PCT_POV_AIAN"
## [15] "ACS_MEDIAN_HOME_VALUE" "ACS_PCT_MEDICAID_ANY"
## [17] "ACS_PCT_MEDICAID_ANY_BELOW64" "ACS_PCT_MEDICARE_ONLY"
## [19] "ACS_PCT_PRIVATE_ANY" "ACS_PCT_UNINSURED"
## [21] "FARA_TRACT_LILA_HALF_10" "FARA_PCT_LA_HALF_10_POP"
## [23] "HRSA_MUA_CENSUS_TRACT" "POS_DIST_ED_TRACT"
## [25] "CEN_AIAN_NH_IND"
#Lowaccess %>% dplyr::summarise(count(COUNTYFIPS))
#This gives a table of the number of tracts per COUNTYFIPS
#Trying a couple of things here to get an accurate county count for low access:
#1
#Lowaccess %>% dplyr::summarise(count(COUNTYFIPS)) %>% dplyr::summarise(sum(freq)) #74133 COUNTYFIPS codes
#did not run second time I ran it
#2
#Lowaccess %>% dplyr::summarise(count(COUNTY)) #1968 counties?
#did not run second time I ran it
#Limiting Diabetes_raw to include variables of most interest:
Diabetes<- dplyr::select(Diabetes_raw, COUNTYFIPS, STATEFIPS, STATE, COUNTY, REGION, ACS_TOT_POP_WT, ACS_MEDIAN_AGE, ACS_PCT_EMPLOYED, ACS_MEDIAN_HH_INC, ACS_PCT_HH_INC_10000, ACS_PCT_HH_INC_100000, ACS_PER_CAPITA_INC, ACS_PCT_POV_AIAN, ACS_MEDIAN_HOME_VALUE, ACS_PCT_MEDICAID_ANY, ACS_PCT_MEDICAID_ANY_BELOW64, ACS_PCT_MEDICARE_ONLY, ACS_PCT_PRIVATE_ANY, ACS_PCT_UNINSURED, CEN_AIAN_NH_IND, ACS_TOT_HH, ACS_AVG_HH_SIZE, ACS_PCT_CHILD_1FAM, CHR_PCT_ADULT_OBESITY, CHR_PCT_DIABETES)
str(Diabetes) #3234 x 25## tibble [3,234 × 25] (S3: tbl_df/tbl/data.frame)
## $ COUNTYFIPS : chr [1:3234] "01001" "01003" "01005" "01007" ...
## $ STATEFIPS : chr [1:3234] "01" "01" "01" "01" ...
## $ STATE : chr [1:3234] "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ COUNTY : chr [1:3234] "Autauga County" "Baldwin County" "Barbour County" "Bibb County" ...
## $ REGION : chr [1:3234] "South" "South" "South" "South" ...
## $ ACS_TOT_POP_WT : num [1:3234] 55221 195121 26932 22604 57710 ...
## $ ACS_MEDIAN_AGE : num [1:3234] 37.7 42.2 38.8 38.9 40.7 39.3 40.5 39.1 43 45.4 ...
## $ ACS_PCT_EMPLOYED : num [1:3234] 92.4 92.5 82.3 91.7 92.3 ...
## $ ACS_MEDIAN_HH_INC : num [1:3234] 51281 50254 32964 38678 45813 ...
## $ ACS_PCT_HH_INC_10000 : num [1:3234] 6.14 6.36 13.5 7.74 7.77 ...
## $ ACS_PCT_HH_INC_100000 : num [1:3234] 19.3 19.94 9.16 9.82 12.88 ...
## $ ACS_PER_CAPITA_INC : num [1:3234] 24974 27317 16824 18431 20532 ...
## $ ACS_PCT_POV_AIAN : num [1:3234] 52.2 9.15 0 4.65 6.6 ...
## $ ACS_MEDIAN_HOME_VALUE : num [1:3234] 141300 169300 92200 102700 119800 ...
## $ ACS_PCT_MEDICAID_ANY : num [1:3234] 12.3 12.3 25.2 18.4 15.4 ...
## $ ACS_PCT_MEDICAID_ANY_BELOW64: num [1:3234] 13.2 14.4 27.6 20.3 16.8 ...
## $ ACS_PCT_MEDICARE_ONLY : num [1:3234] 3.86 6.42 5.31 6.89 5.82 3.54 7.28 4.78 7.51 6.43 ...
## $ ACS_PCT_PRIVATE_ANY : num [1:3234] 59.2 59.1 43.6 56.1 58.4 ...
## $ ACS_PCT_UNINSURED : num [1:3234] 10.12 12.96 15.51 9.66 11.64 ...
## $ CEN_AIAN_NH_IND : chr [1:3234] "0" "0" "0" "0" ...
## $ ACS_TOT_HH : num [1:3234] 20396 74104 9222 7027 20816 ...
## $ ACS_AVG_HH_SIZE : num [1:3234] 2.68 2.6 2.61 2.95 2.74 2.73 2.49 2.52 2.44 2.28 ...
## $ ACS_PCT_CHILD_1FAM : num [1:3234] 22.3 22.5 53.4 26 25.2 ...
## $ CHR_PCT_ADULT_OBESITY : num [1:3234] 37.5 31 44.3 37.8 34.4 39.4 40.2 37.1 40.3 36.3 ...
## $ CHR_PCT_DIABETES : num [1:3234] 14.2 11.3 18 14.9 14.3 20 17.9 16.2 17 13.5 ...
#Diabetes %>% dplyr::summarise(count(COUNTYFIPS)) %>% dplyr::summarise(sum(freq)) #3234 COUNTYFIPS codes (or 3234 unique counties)
#did not run the second time I ran it
#Still need to understand why the apparent difference in county counts above between Low access and Diabetes datasets - #suggestions welcome#Transform low access data that is provided by *census tract* into a dataframe "Lowaccess_county" where FARA_PCT_LA_HALF_10_POP is summarised as a mean value for each county, omitting NA values.
#Possibly instead of mean, I could use a weighted mean - will still need to work through how to do that properly
Lowaccess_county<- Lowaccess %>% filter(!is.na(FARA_PCT_LA_HALF_10_POP)) %>% dplyr::group_by(COUNTYFIPS) %>% dplyr::summarise(MEAN_PCT_LA_COUNTY = mean(FARA_PCT_LA_HALF_10_POP))
head(Lowaccess_county)## # A tibble: 6 × 2
## COUNTYFIPS MEAN_PCT_LA_COUNTY
## <chr> <dbl>
## 1 01001 63.2
## 2 01003 42.1
## 3 01005 31.1
## 4 01007 1.32
## 5 01009 11.1
## 6 01011 71.4
#Lowaccess_county %>% dplyr::summarise(count(COUNTYFIPS)) %>% dplyr::summarise(sum(freq)) #3140 COUNTYFIPS codes
#did not run the second time I ran this
str(Lowaccess_county) #3140 variables## tibble [3,140 × 2] (S3: tbl_df/tbl/data.frame)
## $ COUNTYFIPS : chr [1:3140] "01001" "01003" "01005" "01007" ...
## $ MEAN_PCT_LA_COUNTY: num [1:3140] 63.25 42.07 31.1 1.32 11.07 ...
str(Lowaccess_county$COUNTYFIPS) #3140 values## chr [1:3140] "01001" "01003" "01005" "01007" "01009" "01011" "01013" ...
str(Diabetes$COUNTYFIPS) #3234 values ## chr [1:3234] "01001" "01003" "01005" "01007" "01009" "01011" "01013" ...
#Related to question on line 130, why the differences in countyfips counts between Lowaccess_county and Diabetes dataframes?
#Omitting any na values
Lowaccess_county<- na.omit(Lowaccess_county)
Diabetes<- na.omit(Diabetes)
str(Lowaccess_county$COUNTYFIPS) #3140 values, chr## chr [1:3140] "01001" "01003" "01005" "01007" "01009" "01011" "01013" ...
str(Diabetes$COUNTYFIPS) #3234 values, chr; but another time I ran it there are only 2927 values, #why the difference?## chr [1:2927] "01001" "01003" "01005" "01007" "01009" "01011" "01013" ...
#Still a difference in countyfips counts, why the difference?#Joining Diabetes and Lowaccess_county dataframes
#Notes for thinking about this:
#How many counties are in DM
#How many counties are in Low Access?
#Which counties are different?
#Consider an inner join
#county.x, county.y means that county columns for two different dataframes were joined and both preserved, now distinguished with ".x" or ".y"
#remember that in a join, columns are preserved but row contents may not be
#Which of the following, if either, should I use?
#Innerjoin?
test_innerjoin<- inner_join(Diabetes, Lowaccess_county, by = "COUNTYFIPS")
head(test_innerjoin)## # A tibble: 6 × 26
## COUNTYFIPS STATE…¹ STATE COUNTY REGION ACS_T…² ACS_M…³ ACS_P…⁴ ACS_M…⁵ ACS_P…⁶
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 01001 01 Alab… Autau… South 55221 37.7 92.4 51281 6.14
## 2 01003 01 Alab… Baldw… South 195121 42.2 92.5 50254 6.36
## 3 01005 01 Alab… Barbo… South 26932 38.8 82.4 32964 13.5
## 4 01007 01 Alab… Bibb … South 22604 38.9 91.7 38678 7.74
## 5 01009 01 Alab… Bloun… South 57710 40.7 92.3 45813 7.77
## 6 01011 01 Alab… Bullo… South 10678 39.3 82.0 31938 18.9
## # … with 16 more variables: ACS_PCT_HH_INC_100000 <dbl>,
## # ACS_PER_CAPITA_INC <dbl>, ACS_PCT_POV_AIAN <dbl>,
## # ACS_MEDIAN_HOME_VALUE <dbl>, ACS_PCT_MEDICAID_ANY <dbl>,
## # ACS_PCT_MEDICAID_ANY_BELOW64 <dbl>, ACS_PCT_MEDICARE_ONLY <dbl>,
## # ACS_PCT_PRIVATE_ANY <dbl>, ACS_PCT_UNINSURED <dbl>, CEN_AIAN_NH_IND <chr>,
## # ACS_TOT_HH <dbl>, ACS_AVG_HH_SIZE <dbl>, ACS_PCT_CHILD_1FAM <dbl>,
## # CHR_PCT_ADULT_OBESITY <dbl>, CHR_PCT_DIABETES <dbl>, …
str(test_innerjoin$COUNTYFIPS) #how many values? #3140## chr [1:2925] "01001" "01003" "01005" "01007" "01009" "01011" "01013" ...
#Leftjoin?
test_leftjoin<- left_join(Diabetes, Lowaccess_county, by = "COUNTYFIPS")
head(test_leftjoin)## # A tibble: 6 × 26
## COUNTYFIPS STATE…¹ STATE COUNTY REGION ACS_T…² ACS_M…³ ACS_P…⁴ ACS_M…⁵ ACS_P…⁶
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 01001 01 Alab… Autau… South 55221 37.7 92.4 51281 6.14
## 2 01003 01 Alab… Baldw… South 195121 42.2 92.5 50254 6.36
## 3 01005 01 Alab… Barbo… South 26932 38.8 82.4 32964 13.5
## 4 01007 01 Alab… Bibb … South 22604 38.9 91.7 38678 7.74
## 5 01009 01 Alab… Bloun… South 57710 40.7 92.3 45813 7.77
## 6 01011 01 Alab… Bullo… South 10678 39.3 82.0 31938 18.9
## # … with 16 more variables: ACS_PCT_HH_INC_100000 <dbl>,
## # ACS_PER_CAPITA_INC <dbl>, ACS_PCT_POV_AIAN <dbl>,
## # ACS_MEDIAN_HOME_VALUE <dbl>, ACS_PCT_MEDICAID_ANY <dbl>,
## # ACS_PCT_MEDICAID_ANY_BELOW64 <dbl>, ACS_PCT_MEDICARE_ONLY <dbl>,
## # ACS_PCT_PRIVATE_ANY <dbl>, ACS_PCT_UNINSURED <dbl>, CEN_AIAN_NH_IND <chr>,
## # ACS_TOT_HH <dbl>, ACS_AVG_HH_SIZE <dbl>, ACS_PCT_CHILD_1FAM <dbl>,
## # CHR_PCT_ADULT_OBESITY <dbl>, CHR_PCT_DIABETES <dbl>, …
str(test_leftjoin$COUNTYFIPS) #how many values? #3234## chr [1:2927] "01001" "01003" "01005" "01007" "01009" "01011" "01013" ...
#As noted above, will likely need to start at the beginning to understand why county counts, countyfips counts vary between these two dataframes
#Checking for any evident differences in output between two approaches, at least in headers of each
#Mean PCT LA by county
head(test_innerjoin$MEAN_PCT_LA_COUNTY)## [1] 63.24917 42.07161 31.10111 1.32000 11.06556 71.36667
head(test_leftjoin$MEAN_PCT_LA_COUNTY)## [1] 63.24917 42.07161 31.10111 1.32000 11.06556 71.36667
#Do not appreciate any differences
#PCT Diabetes
head(test_innerjoin$CHR_PCT_DIABETES)## [1] 14.2 11.3 18.0 14.9 14.3 20.0
head(test_leftjoin$CHR_PCT_DIABETES)## [1] 14.2 11.3 18.0 14.9 14.3 20.0
#Do not appreciate any differences#Loading libraries, county geography data in prep for exploratory data analysis, including choropleths:
library(rgdal)## Warning: package 'rgdal' was built under R version 4.2.2
## Loading required package: sp
## Please note that rgdal will be retired during 2023,
## plan transition to sf/stars/terra functions using GDAL and PROJ
## at your earliest convenience.
## See https://r-spatial.org/r/2022/04/12/evolution.html and https://github.com/r-spatial/evolution
## rgdal: version: 1.6-2, (SVN revision 1183)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 3.5.2, released 2022/09/02
## Path to GDAL shared files: C:/Users/McGreevJ/AppData/Local/Programs/R/R-4.2.1/library/rgdal/gdal
## GDAL binary built with GEOS: TRUE
## Loaded PROJ runtime: Rel. 8.2.1, January 1st, 2022, [PJ_VERSION: 821]
## Path to PROJ shared files: C:/Users/McGreevJ/AppData/Local/Programs/R/R-4.2.1/library/rgdal/proj
## PROJ CDN enabled: FALSE
## Linking to sp version:1.5-1
## To mute warnings of possible GDAL/OSR exportToProj4() degradation,
## use options("rgdal_show_exportToProj4_warnings"="none") before loading sp or rgdal.
library(leaflet)## Warning: package 'leaflet' was built under R version 4.2.2
library(ggplot2)
library(RColorBrewer)
library(plyr)## Warning: package 'plyr' was built under R version 4.2.2
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following object is masked from 'package:purrr':
##
## compact
library(rgdal)
library(sf)## Warning: package 'sf' was built under R version 4.2.2
## Linking to GEOS 3.9.3, GDAL 3.5.2, PROJ 8.2.1; sf_use_s2() is TRUE
library(tidycensus)## Warning: package 'tidycensus' was built under R version 4.2.2
library(ggsn)## Warning: package 'ggsn' was built under R version 4.2.2
## Loading required package: grid
library(leaflet)
library(ggplot2)
library(RColorBrewer)
counties<- readRDS(gzcon(url("https://raw.githubusercontent.com/HimesGroup/BMIN503/master/DataFiles/uscounties_2010.rds")))
st_crs(counties) <- st_crs(counties)## old-style crs object detected; please recreate object with a recent sf::st_crs()
## old-style crs object detected; please recreate object with a recent sf::st_crs()
#using base plot function to check that counties contains polygon data:
plot(counties)#using ggplot to check that counties contains polygon data:
ggplot() +
geom_sf(data = counties)#Joining combined data set (test_leftjoin for now, noting I still need to determine if that was the correct join) and counties, but first, need to create a common variable by which to join. Will convert State to a 2 digit number in Counties, then mutate State and County into a 5 digit number, a mutated variable in Counties called "COUNTYFIPS" to correspond to "COUNTYFIPS" in the combined data set, test_leftjoin.
counties$STATE<- str_pad(counties$STATE, width=2, side="left", pad="0")
head(counties)## Simple feature collection with 6 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -88.13999 ymin: 31.19518 xmax: -85.12342 ymax: 34.8922
## Geodetic CRS: WGS 84
## GEO_ID STATE COUNTY NAME LSAD CENSUSAREA
## 1 0500000US01001 01 001 Autauga County 594.436
## 2 0500000US01009 01 009 Blount County 644.776
## 3 0500000US01017 01 017 Chambers County 596.531
## 4 0500000US01021 01 021 Chilton County 692.854
## 5 0500000US01033 01 033 Colbert County 592.619
## 6 0500000US01045 01 045 Dale County 561.150
## geometry
## 1 MULTIPOLYGON (((-86.49677 3...
## 2 MULTIPOLYGON (((-86.5778 33...
## 3 MULTIPOLYGON (((-85.18413 3...
## 4 MULTIPOLYGON (((-86.51734 3...
## 5 MULTIPOLYGON (((-88.13999 3...
## 6 MULTIPOLYGON (((-85.41644 3...
counties<- counties %>% dplyr::mutate(COUNTYFIPS = paste(STATE, COUNTY, sep = ""))
head(counties)## Simple feature collection with 6 features and 7 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -88.13999 ymin: 31.19518 xmax: -85.12342 ymax: 34.8922
## Geodetic CRS: WGS 84
## GEO_ID STATE COUNTY NAME LSAD CENSUSAREA
## 1 0500000US01001 01 001 Autauga County 594.436
## 2 0500000US01009 01 009 Blount County 644.776
## 3 0500000US01017 01 017 Chambers County 596.531
## 4 0500000US01021 01 021 Chilton County 692.854
## 5 0500000US01033 01 033 Colbert County 592.619
## 6 0500000US01045 01 045 Dale County 561.150
## geometry COUNTYFIPS
## 1 MULTIPOLYGON (((-86.49677 3... 01001
## 2 MULTIPOLYGON (((-86.5778 33... 01009
## 3 MULTIPOLYGON (((-85.18413 3... 01017
## 4 MULTIPOLYGON (((-86.51734 3... 01021
## 5 MULTIPOLYGON (((-88.13999 3... 01033
## 6 MULTIPOLYGON (((-85.41644 3... 01045
str(counties$COUNTYFIPS) #3109 values## chr [1:3109] "01001" "01009" "01017" "01021" "01033" "01045" "01051" ...
str(test_leftjoin$COUNTYFIPS) #2927 values, #why the difference?## chr [1:2927] "01001" "01003" "01005" "01007" "01009" "01011" "01013" ...
test_counties_left<- left_join(counties, test_leftjoin, by = 'COUNTYFIPS')
str(test_counties_left) #3109 obs of 33 variables## Classes 'sf' and 'data.frame': 3109 obs. of 33 variables:
## $ GEO_ID : Factor w/ 3221 levels "0500000US01001",..: 1 5 9 11 17 23 26 33 40 42 ...
## $ STATE.x : chr "01" "01" "01" "01" ...
## $ COUNTY.x : Factor w/ 325 levels "001","003","005",..: 1 6 13 16 23 30 34 43 53 55 ...
## $ NAME : Factor w/ 1909 levels "Abbeville","Acadia",..: 89 174 309 341 387 460 567 730 986 1009 ...
## $ LSAD : Factor w/ 8 levels "Borough","CA",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ CENSUSAREA : num 594 645 597 693 593 ...
## $ COUNTYFIPS : chr "01001" "01009" "01017" "01021" ...
## $ STATEFIPS : chr "01" "01" "01" "01" ...
## $ STATE.y : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ COUNTY.y : chr "Autauga County" "Blount County" "Chambers County" "Chilton County" ...
## $ REGION : chr "South" "South" "South" "South" ...
## $ ACS_TOT_POP_WT : num 55221 57710 34079 43819 54444 ...
## $ ACS_MEDIAN_AGE : num 37.7 40.7 43 38.8 42.2 36.8 37.9 40.2 41.7 39 ...
## $ ACS_PCT_EMPLOYED : num 92.4 92.3 91.1 90.8 91 ...
## $ ACS_MEDIAN_HH_INC : num 51281 45813 34177 41627 40576 ...
## $ ACS_PCT_HH_INC_10000 : num 6.14 7.77 10.63 9.22 10.35 ...
## $ ACS_PCT_HH_INC_100000 : num 19.3 12.88 8.52 12.55 11.68 ...
## $ ACS_PER_CAPITA_INC : num 24974 20532 21071 21399 22546 ...
## $ ACS_PCT_POV_AIAN : num 52.2 6.6 36.84 6.33 18.24 ...
## $ ACS_MEDIAN_HOME_VALUE : num 141300 119800 80800 100100 99400 ...
## $ ACS_PCT_MEDICAID_ANY : num 12.3 15.4 19.9 16 15.1 ...
## $ ACS_PCT_MEDICAID_ANY_BELOW64: num 13.2 16.8 21.3 17.6 16.7 ...
## $ ACS_PCT_MEDICARE_ONLY : num 3.86 5.82 7.51 6.28 5.76 3.9 3.54 5.69 5.23 4.2 ...
## $ ACS_PCT_PRIVATE_ANY : num 59.2 58.4 50.6 53.8 58.6 ...
## $ ACS_PCT_UNINSURED : num 10.1 11.6 13.9 16 11.1 ...
## $ CEN_AIAN_NH_IND : chr "0" "0" "0" "0" ...
## $ ACS_TOT_HH : num 20396 20816 13787 16237 22204 ...
## $ ACS_AVG_HH_SIZE : num 2.68 2.74 2.44 2.68 2.43 2.56 2.65 2.53 2.46 2.64 ...
## $ ACS_PCT_CHILD_1FAM : num 22.3 25.2 50.4 26.7 31.1 ...
## $ CHR_PCT_ADULT_OBESITY : num 37.5 34.4 40.3 35.3 32.5 37.3 36.1 42 35 33.6 ...
## $ CHR_PCT_DIABETES : num 14.2 14.3 17 14.7 17.6 15.6 14.6 16.1 14.9 12.4 ...
## $ MEAN_PCT_LA_COUNTY : num 63.2 11.1 36 11.2 54.3 ...
## $ geometry :sfc_MULTIPOLYGON of length 3109; first list element: List of 1
## ..$ :List of 1
## .. ..$ : num [1:10, 1:2] -86.5 -86.7 -86.8 -86.9 -86.9 ...
## ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
## - attr(*, "sf_column")= chr "geometry"
## - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
## ..- attr(*, "names")= chr [1:32] "GEO_ID" "STATE.x" "COUNTY.x" "NAME" ...
test_counties_inner<- inner_join
str(test_counties_inner) #2893 obs of 33 variables## function (x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ..., keep = FALSE)
#Exploratory data analysis - sorting data, top/bottom rates and corresponding counties
#Starting with low access...
colnames(test_counties_left)## [1] "GEO_ID" "STATE.x"
## [3] "COUNTY.x" "NAME"
## [5] "LSAD" "CENSUSAREA"
## [7] "COUNTYFIPS" "STATEFIPS"
## [9] "STATE.y" "COUNTY.y"
## [11] "REGION" "ACS_TOT_POP_WT"
## [13] "ACS_MEDIAN_AGE" "ACS_PCT_EMPLOYED"
## [15] "ACS_MEDIAN_HH_INC" "ACS_PCT_HH_INC_10000"
## [17] "ACS_PCT_HH_INC_100000" "ACS_PER_CAPITA_INC"
## [19] "ACS_PCT_POV_AIAN" "ACS_MEDIAN_HOME_VALUE"
## [21] "ACS_PCT_MEDICAID_ANY" "ACS_PCT_MEDICAID_ANY_BELOW64"
## [23] "ACS_PCT_MEDICARE_ONLY" "ACS_PCT_PRIVATE_ANY"
## [25] "ACS_PCT_UNINSURED" "CEN_AIAN_NH_IND"
## [27] "ACS_TOT_HH" "ACS_AVG_HH_SIZE"
## [29] "ACS_PCT_CHILD_1FAM" "CHR_PCT_ADULT_OBESITY"
## [31] "CHR_PCT_DIABETES" "MEAN_PCT_LA_COUNTY"
## [33] "geometry"
#Inspecting a trimmed-down dataframe
selection<- test_counties_left %>% dplyr::select(COUNTY.y, STATE.y, MEAN_PCT_LA_COUNTY)
View(selection)#Mean low access percentage
mean_LA<- test_counties_left %>% dplyr::summarise(meanLA = mean(MEAN_PCT_LA_COUNTY, na.rm = TRUE))
mean_LA #mean low access was 38.21%## Simple feature collection with 1 feature and 1 field
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -124.7332 ymin: 24.5447 xmax: -66.9499 ymax: 49.38436
## Geodetic CRS: WGS 84
## meanLA geometry
## 1 38.21385 MULTIPOLYGON (((-96.83003 2...
#Median low access percentage
median_LA<- test_counties_left %>% dplyr::summarise(medianLA = median(MEAN_PCT_LA_COUNTY, na.rm = TRUE))
median_LA #median low access was 37.18%## Simple feature collection with 1 feature and 1 field
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -124.7332 ymin: 24.5447 xmax: -66.9499 ymax: 49.38436
## Geodetic CRS: WGS 84
## medianLA geometry
## 1 37.1827 MULTIPOLYGON (((-96.83003 2...
#How many counties have 100% of population with low access?
y<-test_counties_left$MEAN_PCT_LA_COUNTY
counttoplowaccess<-length(which(y == 100))
counttoplowaccess #37 counties have 100% of the population with low access to food## [1] 37
#Lowest access counties
Lowest_access_counties<- test_counties_left %>% dplyr::select(COUNTYFIPS, REGION, STATE.y, COUNTY.y, MEAN_PCT_LA_COUNTY) %>% slice_max(MEAN_PCT_LA_COUNTY)
Lowest_access_counties## Simple feature collection with 37 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -120.4952 ymin: 29.77719 xmax: -79.31228 ymax: 48.99902
## Geodetic CRS: WGS 84
## First 10 features:
## COUNTYFIPS REGION STATE.y COUNTY.y MEAN_PCT_LA_COUNTY
## 1 20075 Midwest Kansas Hamilton County 100
## 2 20203 Midwest Kansas Wichita County 100
## 3 30059 West Montana Meagher County 100
## 4 30107 West Montana Wheatland County 100
## 5 40043 South Oklahoma Dewey County 100
## 6 38007 Midwest North Dakota Billings County 100
## 7 41069 West Oregon Wheeler County 100
## 8 48359 South Texas Oldham County 100
## 9 08061 West Colorado Kiowa County 100
## 10 16025 West Idaho Camas County 100
## geometry
## 1 MULTIPOLYGON (((-101.5675 3...
## 2 MULTIPOLYGON (((-101.5671 3...
## 3 MULTIPOLYGON (((-110.2819 4...
## 4 MULTIPOLYGON (((-109.6543 4...
## 5 MULTIPOLYGON (((-99.38102 3...
## 6 MULTIPOLYGON (((-103.6092 4...
## 7 MULTIPOLYGON (((-120.3714 4...
## 8 MULTIPOLYGON (((-103.0424 3...
## 9 MULTIPOLYGON (((-102.0452 3...
## 10 MULTIPOLYGON (((-114.3749 4...
View(Lowest_access_counties)#In what counties do more than 75% of the population have low food access?
yseventyfive<- test_counties_left$MEAN_PCT_LA_COUNTY
countseventyfive<- length(which(yseventyfive > 75))
countseventyfive #In 126 counties greater than 75% of the population has low food access## [1] 126
Topquartile<- test_counties_left %>% dplyr::select(COUNTYFIPS, REGION, STATE.y, COUNTY.y, MEAN_PCT_LA_COUNTY) %>% filter(MEAN_PCT_LA_COUNTY > 75)
Topquartile## Simple feature collection with 126 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -123.7279 ymin: 26.7174 xmax: -71.2248 ymax: 49.00069
## Geodetic CRS: WGS 84
## First 10 features:
## COUNTYFIPS REGION STATE.y COUNTY.y MEAN_PCT_LA_COUNTY
## 1 08014 West Colorado Broomfield County 81.58400
## 2 13059 South Georgia Clarke County 75.34867
## 3 08111 West Colorado San Juan County 99.24000
## 4 20049 Midwest Kansas Elk County 99.36000
## 5 20075 Midwest Kansas Hamilton County 100.00000
## 6 20203 Midwest Kansas Wichita County 100.00000
## 7 31163 Midwest Nebraska Sherman County 83.43000
## 8 31171 Midwest Nebraska Thomas County 99.94000
## 9 32009 West Nevada Esmeralda County 97.19000
## 10 32021 West Nevada Mineral County 86.83500
## geometry
## 1 MULTIPOLYGON (((-105.1473 3...
## 2 MULTIPOLYGON (((-83.36003 3...
## 3 MULTIPOLYGON (((-107.7383 3...
## 4 MULTIPOLYGON (((-96.52569 3...
## 5 MULTIPOLYGON (((-101.5675 3...
## 6 MULTIPOLYGON (((-101.5671 3...
## 7 MULTIPOLYGON (((-98.74853 4...
## 8 MULTIPOLYGON (((-100.8461 4...
## 9 MULTIPOLYGON (((-117.691 38...
## 10 MULTIPOLYGON (((-117.691 38...
View(Topquartile)#What regions have the lowest food access?
Topregions<- Topquartile %>% dplyr::summarise(count(REGION))
Topregions## Simple feature collection with 4 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -123.7279 ymin: 26.7174 xmax: -71.2248 ymax: 49.00069
## Geodetic CRS: WGS 84
## x freq geometry
## 1 Midwest 48 MULTIPOLYGON (((-96.81783 3...
## 2 Northeast 1 MULTIPOLYGON (((-96.81783 3...
## 3 South 38 MULTIPOLYGON (((-96.81783 3...
## 4 West 39 MULTIPOLYGON (((-96.81783 3...
Topregions_summary<- dplyr::arrange(Topregions, desc(freq))
Topregions_summary## Simple feature collection with 4 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -123.7279 ymin: 26.7174 xmax: -71.2248 ymax: 49.00069
## Geodetic CRS: WGS 84
## x freq geometry
## 1 Midwest 48 MULTIPOLYGON (((-96.81783 3...
## 2 West 39 MULTIPOLYGON (((-96.81783 3...
## 3 South 38 MULTIPOLYGON (((-96.81783 3...
## 4 Northeast 1 MULTIPOLYGON (((-96.81783 3...
#Ranked by number of counties with >75% of population with low food access:Midwest with 48 counties; West with 39 counties, South with 38 counties, and Northeast with 1 county, for 126 counties total #How many counties have 0% of population with low access (that is, they are high food access)?
y1<-test_counties_left$MEAN_PCT_LA_COUNTY
countleastlowaccess<-length(which(y1 == 0))
countleastlowaccess #29 counties have 0% of the population with low access to food## [1] 29
#Highest access counties (0% of population is low access)
Highest_access_counties<- test_counties_left %>% dplyr::select(COUNTYFIPS, REGION, STATE.y, COUNTY.y, MEAN_PCT_LA_COUNTY) %>% slice_min(MEAN_PCT_LA_COUNTY)
Highest_access_counties## Simple feature collection with 29 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -105.6903 ymin: 31.43369 xmax: -76.24528 ymax: 45.21074
## Geodetic CRS: WGS 84
## First 10 features:
## COUNTYFIPS REGION STATE.y COUNTY.y MEAN_PCT_LA_COUNTY
## 1 08047 West Colorado Gilpin County 0
## 2 17035 Midwest Illinois Cumberland County 0
## 3 37103 South North Carolina Jones County 0
## 4 47057 South Tennessee Grainger County 0
## 5 48379 South Texas Rains County 0
## 6 48425 South Texas Somervell County 0
## 7 13119 South Georgia Franklin County 0
## 8 13147 South Georgia Hart County 0
## 9 28031 South Mississippi Covington County 0
## 10 26019 Midwest Michigan Benzie County 0
## geometry
## 1 MULTIPOLYGON (((-105.3978 3...
## 2 MULTIPOLYGON (((-88.01212 3...
## 3 MULTIPOLYGON (((-77.47372 3...
## 4 MULTIPOLYGON (((-83.66746 3...
## 5 MULTIPOLYGON (((-95.66539 3...
## 6 MULTIPOLYGON (((-97.86486 3...
## 7 MULTIPOLYGON (((-83.35527 3...
## 8 MULTIPOLYGON (((-82.99222 3...
## 9 MULTIPOLYGON (((-89.39918 3...
## 10 MULTIPOLYGON (((-86.2335 44...
View(Highest_access_counties)
#Need to check this...some likely rural South counties where I am surprised that the county would be 0% low access#How many counties are missing values for low access?
y2<-test_counties_left$MEAN_PCT_LA_COUNTY
countmissinglowaccess<-length(which(is.na(y2)))
countmissinglowaccess #216 counties are missing values for low access## [1] 216
#Missing access counties (by COUNTYFIPS)
Missing_access_counties<- test_counties_left %>% dplyr::select(COUNTYFIPS, REGION, STATE.y, COUNTY.y, MEAN_PCT_LA_COUNTY) %>% dplyr::filter(is.na(MEAN_PCT_LA_COUNTY))
View(Missing_access_counties)
#Why are county names missing? #How can I get county names (back)?#Continuing exploratory data analysis with diabetes...
#Inspecting a trimmed-down dataframe
selection_dm<- test_counties_left %>% dplyr::select(COUNTY.y, STATE.y, CHR_PCT_DIABETES)
View(selection_dm)#Mean percentage of population with diabetes
mean_DM<- test_counties_left %>% dplyr::summarise(meanDM = mean(CHR_PCT_DIABETES, na.rm = TRUE))
mean_DM #Mean percentage of population with diabetes was 11.57%## Simple feature collection with 1 feature and 1 field
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -124.7332 ymin: 24.5447 xmax: -66.9499 ymax: 49.38436
## Geodetic CRS: WGS 84
## meanDM geometry
## 1 11.57276 MULTIPOLYGON (((-96.83003 2...
#Median percentage of population with diabetes
median_DM<- test_counties_left %>% dplyr::summarise(medianDM = median(CHR_PCT_DIABETES, na.rm = TRUE))
median_DM #Median perentage of population with diabetes was 11.4%## Simple feature collection with 1 feature and 1 field
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -124.7332 ymin: 24.5447 xmax: -66.9499 ymax: 49.38436
## Geodetic CRS: WGS 84
## medianDM geometry
## 1 11.4 MULTIPOLYGON (((-96.83003 2...
#What were the top 100 U.S. counties in terms of greatest percentage of population with diabetes in 2015?
Diabetes_high<- test_counties_left %>% dplyr::select(COUNTYFIPS, REGION, STATE.y, COUNTY.y, CHR_PCT_DIABETES) %>% slice_max(CHR_PCT_DIABETES, n = 100)
View(Diabetes_high)#What regions had the greatest number of counties in the top 100 in terms of percentage of population with diabetes in 2015?
Topregions_DM<- Diabetes_high %>% dplyr::summarize(count(REGION) %>% dplyr::arrange(desc(freq)))
View(Topregions_DM)
#South and Midwest. #95 of the top 100 counties in terms of highest percentages of diabetes are in the South. 5 are in the Midwest.#What are the 100 counties with the lowest percentages of their populations with diabetes?
Diabetes_low<- test_counties_left %>% dplyr::select(COUNTYFIPS, REGION, STATE.y, COUNTY.y, CHR_PCT_DIABETES) %>% slice_min(CHR_PCT_DIABETES, n = 100)
View(Diabetes_low) #What regions are represented by the bottom 100 counties, those counties that have the lowest percentage of population with diabetes?
Bottomregions_DM<- Diabetes_low %>% dplyr::summarize(count(REGION) %>% dplyr::arrange(desc(freq)))
View(Bottomregions_DM)
#West (64 counties), Midwest (25 counties), Northeast (9 counties), South (8 counties)#How many counties are missing values for diabetes prevalence?
y_dm<-test_counties_left$CHR_PCT_DIABETES
countmissingdm<-length(which(is.na(y_dm)))
countmissingdm #216 counties are missing values for diabetes prevalence## [1] 216
#Missing access counties (by COUNTYFIPS)
Missing_dm_counties<- test_counties_left %>% dplyr::select(COUNTYFIPS, REGION, STATE.y, COUNTY.y, CHR_PCT_DIABETES) %>% dplyr::filter(is.na(CHR_PCT_DIABETES))
View(Missing_dm_counties)
#Why are county names not showing? Need to figure this out.#Exploratory data analysis - potential choropleths and/or a leaflet
library(RColorBrewer)
myPalette <- colorRampPalette(brewer.pal(9, "BuPu"))
my_theme <- function() {
theme_minimal() +
theme(axis.line = element_blank(),
axis.text = element_blank(),
axis.title = element_blank(),
panel.grid = element_line(color = "white"),
legend.key.size = unit(0.5, "cm"),
legend.text = element_text(size = 10),
legend.title = element_text(size = 12),
plot.title = element_text(size = 16))
}
myPalette <- colorRampPalette(brewer.pal(9, "YlOrRd"))
#colors
library(leaflet)
# Select a color palette with which to run the palette function
pal_fun <- colorNumeric("BuPu", NULL) # Blue-Purple from RColorBrewer
pal_fun2 <- colorNumeric("YlOrRd", NULL) # Yellow-Orange-Red from RColorBrewer
pal_fun3 <- colorNumeric("viridis", NULL) # viridis from viridis
pal_fun4 <- colorNumeric("inferno", NULL) # inferno from viridis
pal_fun5 <- colorNumeric("inferno", NULL, reverse=TRUE) # reverses the color ramp
#The following are some options - welcome any suggestions for making them prettier, less clunky, noting scalebar overlaying the choropleth in at least once case!
#Low access plot, using left join, pink/white
ggplot() +
geom_sf(data = test_counties_left, aes(fill = MEAN_PCT_LA_COUNTY), lwd = 0) +
my_theme() +
ggtitle("Percent of population with low access to food, by U.S. county, 2015, left, pink scheme") +
scale_fill_continuous(low = "antiquewhite2", high = "red", guide = "colorbar") +
north(x.min = -75.28031, y.min = 39.86747, # add north bar with north()
x.max = -74.95575, y.max = 40.13793, # set map boundaries with x.min, y.min, etc..
symbol=12, # select north arrow symbol
anchor = c(x = -75, y = 39.93)) + # set north bar location
scalebar(x.min = -75.28031, y.min = 39.86747, # add scalebar with scalebar()
x.max = -74.95575, y.max = 40.13793, # set map boundaries
dist = 5, dist_unit = 'km', # set scalebar segment length = 5km
transform = TRUE, # TRUE for decimal degree coordinates
model = "WGS84") # specify CR#OR....
#Low access plot, using left join, blue:
ggplot() +
geom_sf(data = test_counties_left, aes(fill =
MEAN_PCT_LA_COUNTY), lwd = 0) +
my_theme() +
ggtitle("Percent of population with low access to food, by U.S. county, 2015, left, blue scheme") +
theme(legend.position = "topright")#Starter choropleth for diabetes....
#Diabetes plot, using left join, blue:
ggplot() +
geom_sf(data = test_counties_left, aes(fill = CHR_PCT_DIABETES), lwd = 0) +
my_theme() +
ggtitle("Percent of population with diabetes, by U.S. county, 2015, left, blue theme")+
theme(legend.position = "none")#OR...
#Alternative to create an interactive diabetes map:
#Popup message
pu_message <- paste0(test_counties_left$COUNTY.y, ", ", test_counties_left$STATE.y, "<br>Diabetes prevalence 2015: ", round(test_counties_left$CHR_PCT_DIABETES, 1), "%")
#Interactive map with legend and scalebar
leaflet(test_counties_left) %>%
addPolygons(fillColor = ~pal_fun4(CHR_PCT_DIABETES),
popup = pu_message) %>%
addTiles() %>% addLegend("bottomright", pal=pal_fun4, values = ~CHR_PCT_DIABETES,
title = '2015 Diabetes prevalence rates',
opacity = 1) %>% addScaleBar() ###Scatterplot, correlation
#Which, if any, of the following are worth keeping? Suggestions welcome.
library(modelsummary)##
## Attaching package: 'modelsummary'
## The following object is masked from 'package:tidycensus':
##
## get_estimates
#Scatterplot
ggplot(test_counties_left, aes(x = MEAN_PCT_LA_COUNTY, y = CHR_PCT_DIABETES)) +
geom_point(color = "darkseagreen3") ## Warning: Removed 216 rows containing missing values (geom_point).
#Correlation
ggplot(data = test_counties_left, aes(x = MEAN_PCT_LA_COUNTY, y = CHR_PCT_DIABETES)) +
geom_point(color = "darkseagreen3") + ggtitle("Scatterplot: Degree of low access vs. Diabetes prevalence")## Warning: Removed 216 rows containing missing values (geom_point).
#Pearson
cor(test_counties_left$MEAN_PCT_LA_COUNTY, test_counties_left$CHR_PCT_DIABETES, use = "complete.obs") ## [1] -0.2911943
#Spearman
cor(test_counties_left$MEAN_PCT_LA_COUNTY, test_counties_left$CHR_PCT_DIABETES, use = "complete.obs", method = "spearman") #Rank-based correlation## [1] -0.3106256
cor(test_counties_left$MEAN_PCT_LA_COUNTY, test_counties_left$CHR_PCT_DIABETES, use = "complete.obs") ## [1] -0.2911943
#Logistic model
LA_DM_lm_fit <- lm(MEAN_PCT_LA_COUNTY ~ CHR_PCT_DIABETES, data = test_counties_left)
LA_DM_lm_fit##
## Call:
## lm(formula = MEAN_PCT_LA_COUNTY ~ CHR_PCT_DIABETES, data = test_counties_left)
##
## Coefficients:
## (Intercept) CHR_PCT_DIABETES
## 67.96 -2.57
summary_LA_DM_lm_fit <- summary.lm(LA_DM_lm_fit) #Summary of results
summary_LA_DM_lm_fit##
## Call:
## lm(formula = MEAN_PCT_LA_COUNTY ~ CHR_PCT_DIABETES, data = test_counties_left)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.510 -16.022 -0.798 14.647 71.366
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 67.9600 1.8608 36.52 <2e-16 ***
## CHR_PCT_DIABETES -2.5704 0.1571 -16.37 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.44 on 2891 degrees of freedom
## (216 observations deleted due to missingness)
## Multiple R-squared: 0.08479, Adjusted R-squared: 0.08448
## F-statistic: 267.9 on 1 and 2891 DF, p-value: < 2.2e-16
names(summary_LA_DM_lm_fit) #We can retrieve output from summary statistics for model fit## [1] "call" "terms" "residuals" "coefficients"
## [5] "aliased" "sigma" "df" "r.squared"
## [9] "adj.r.squared" "fstatistic" "cov.unscaled" "na.action"
summary_LA_DM_lm_fit$adj.r.squared## [1] 0.08447756
coef(LA_DM_lm_fit) #We recover the intercept and coefficient for x that are expected## (Intercept) CHR_PCT_DIABETES
## 67.959951 -2.570355
confint(LA_DM_lm_fit) #Confidence intervals for fit## 2.5 % 97.5 %
## (Intercept) 64.311420 71.608483
## CHR_PCT_DIABETES -2.878302 -2.262408
ggplot(test_counties_left, aes(x = MEAN_PCT_LA_COUNTY, y = CHR_PCT_DIABETES)) +
geom_point(color = "darkseagreen3") +
geom_smooth(method = "lm", color = "green4")## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 216 rows containing non-finite values (stat_smooth).
## Warning: Removed 216 rows containing missing values (geom_point).
modelsummary(LA_DM_lm_fit, statistic = "p.value") #Nicer table of output| Model 1 | |
|---|---|
| (Intercept) | 67.960 |
| (0.000) | |
| CHR_PCT_DIABETES | −2.570 |
| (0.000) | |
| Num.Obs. | 2893 |
| R2 | 0.085 |
| R2 Adj. | 0.084 |
| AIC | 25950.8 |
| BIC | 25968.7 |
| Log.Lik. | −12972.415 |
| RMSE | 21.44 |
summary(LA_DM_lm_fit)##
## Call:
## lm(formula = MEAN_PCT_LA_COUNTY ~ CHR_PCT_DIABETES, data = test_counties_left)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.510 -16.022 -0.798 14.647 71.366
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 67.9600 1.8608 36.52 <2e-16 ***
## CHR_PCT_DIABETES -2.5704 0.1571 -16.37 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.44 on 2891 degrees of freedom
## (216 observations deleted due to missingness)
## Multiple R-squared: 0.08479, Adjusted R-squared: 0.08448
## F-statistic: 267.9 on 1 and 2891 DF, p-value: < 2.2e-16
#How best to interpret these findings? As food access becomes more limited in a county, prevalence of diabetes decreases?#Assessing other variables that may have a relationship to diabetes prevalence, beyond low access to food...
#Median household income
ggplot(test_counties_left, aes(x = ACS_MEDIAN_HH_INC, y = CHR_PCT_DIABETES)) +
geom_point(color = "sienna4") + ggtitle("Median household income relationship to diabetes prevalence")## Warning: Removed 216 rows containing missing values (geom_point).
summary((lm(CHR_PCT_DIABETES ~ ACS_MEDIAN_HH_INC, data = test_counties_left)))##
## Call:
## lm(formula = CHR_PCT_DIABETES ~ ACS_MEDIAN_HH_INC, data = test_counties_left)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.3549 -1.3533 -0.0076 1.3629 7.1082
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.713e+01 1.582e-01 108.30 <2e-16 ***
## ACS_MEDIAN_HH_INC -1.180e-04 3.254e-06 -36.27 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.105 on 2891 degrees of freedom
## (216 observations deleted due to missingness)
## Multiple R-squared: 0.3127, Adjusted R-squared: 0.3125
## F-statistic: 1315 on 1 and 2891 DF, p-value: < 2.2e-16
#Per capita income
ggplot(test_counties_left, aes(x = ACS_PER_CAPITA_INC, y = CHR_PCT_DIABETES)) +
geom_point(color = "sienna4") + ggtitle("Per capita income relationship to diabetes prevalence")## Warning: Removed 216 rows containing missing values (geom_point).
summary((lm(CHR_PCT_DIABETES ~ ACS_PER_CAPITA_INC, data = test_counties_left)))##
## Call:
## lm(formula = CHR_PCT_DIABETES ~ ACS_PER_CAPITA_INC, data = test_counties_left)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.748 -1.393 0.043 1.371 7.391
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.742e+01 1.754e-01 99.30 <2e-16 ***
## ACS_PER_CAPITA_INC -2.390e-04 6.987e-06 -34.21 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.143 on 2891 degrees of freedom
## (216 observations deleted due to missingness)
## Multiple R-squared: 0.2881, Adjusted R-squared: 0.2879
## F-statistic: 1170 on 1 and 2891 DF, p-value: < 2.2e-16
#Obesity
ggplot(test_counties_left, aes(x = CHR_PCT_ADULT_OBESITY, y = CHR_PCT_DIABETES)) +
geom_point(color = "sienna4") +
ggtitle("Obesity rate relationship to diabetes prevalence") ## Warning: Removed 216 rows containing missing values (geom_point).
summary((lm(CHR_PCT_DIABETES ~ CHR_PCT_ADULT_OBESITY, data = test_counties_left)))##
## Call:
## lm(formula = CHR_PCT_DIABETES ~ CHR_PCT_ADULT_OBESITY, data = test_counties_left)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.0388 -1.2559 -0.1007 1.1966 7.1952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.350080 0.249183 -1.405 0.16
## CHR_PCT_ADULT_OBESITY 0.372427 0.007706 48.330 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.889 on 2891 degrees of freedom
## (216 observations deleted due to missingness)
## Multiple R-squared: 0.4469, Adjusted R-squared: 0.4467
## F-statistic: 2336 on 1 and 2891 DF, p-value: < 2.2e-16
#Uninsured status
ggplot(test_counties_left, aes(x = ACS_PCT_UNINSURED, y = CHR_PCT_DIABETES)) +
geom_point(color = "sienna4") + ggtitle("Uninsured status relationship to diabetes prevalence")## Warning: Removed 216 rows containing missing values (geom_point).
summary((lm(CHR_PCT_DIABETES ~ ACS_PCT_UNINSURED, data = test_counties_left)))##
## Call:
## lm(formula = CHR_PCT_DIABETES ~ ACS_PCT_UNINSURED, data = test_counties_left)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9412 -1.6379 -0.0419 1.5832 8.8885
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.963767 0.124819 79.83 <2e-16 ***
## ACS_PCT_UNINSURED 0.122181 0.008819 13.85 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 2891 degrees of freedom
## (216 observations deleted due to missingness)
## Multiple R-squared: 0.06225, Adjusted R-squared: 0.06193
## F-statistic: 191.9 on 1 and 2891 DF, p-value: < 2.2e-16
#Combined linear models
#Including median income
lm_medincome <- lm(MEAN_PCT_LA_COUNTY ~ CHR_PCT_DIABETES + ACS_MEDIAN_HH_INC, data = test_counties_left)
summary(lm_medincome)##
## Call:
## lm(formula = MEAN_PCT_LA_COUNTY ~ CHR_PCT_DIABETES + ACS_MEDIAN_HH_INC,
## data = test_counties_left)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.409 -15.728 -0.633 14.607 70.812
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.360e+01 3.611e+00 14.843 < 2e-16 ***
## CHR_PCT_DIABETES -2.081e+00 1.888e-01 -11.023 < 2e-16 ***
## ACS_MEDIAN_HH_INC 1.847e-04 3.985e-05 4.636 3.7e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.37 on 2890 degrees of freedom
## (216 observations deleted due to missingness)
## Multiple R-squared: 0.09155, Adjusted R-squared: 0.09092
## F-statistic: 145.6 on 2 and 2890 DF, p-value: < 2.2e-16
#Including per capita income
lm_percapincome <- lm(MEAN_PCT_LA_COUNTY ~ CHR_PCT_DIABETES + ACS_PER_CAPITA_INC, data = test_counties_left)
summary(lm_percapincome)##
## Call:
## lm(formula = MEAN_PCT_LA_COUNTY ~ CHR_PCT_DIABETES + ACS_PER_CAPITA_INC,
## data = test_counties_left)
##
## Residuals:
## Min 1Q Median 3Q Max
## -63.404 -15.747 -0.585 14.607 72.604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.274e+01 3.672e+00 14.361 < 2e-16 ***
## CHR_PCT_DIABETES -2.093e+00 1.854e-01 -11.284 < 2e-16 ***
## ACS_PER_CAPITA_INC 3.964e-04 8.256e-05 4.801 1.66e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.36 on 2890 degrees of freedom
## (216 observations deleted due to missingness)
## Multiple R-squared: 0.09203, Adjusted R-squared: 0.09141
## F-statistic: 146.5 on 2 and 2890 DF, p-value: < 2.2e-16
#Including obesity
lm_obesity <- lm(MEAN_PCT_LA_COUNTY ~ CHR_PCT_DIABETES + CHR_PCT_ADULT_OBESITY, data = test_counties_left)
summary(lm_obesity)##
## Call:
## lm(formula = MEAN_PCT_LA_COUNTY ~ CHR_PCT_DIABETES + CHR_PCT_ADULT_OBESITY,
## data = test_counties_left)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.785 -15.979 -0.739 14.674 71.461
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 65.0454 2.8299 22.985 <2e-16 ***
## CHR_PCT_DIABETES -2.7633 0.2111 -13.087 <2e-16 ***
## CHR_PCT_ADULT_OBESITY 0.1608 0.1176 1.367 0.172
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.44 on 2890 degrees of freedom
## (216 observations deleted due to missingness)
## Multiple R-squared: 0.08539, Adjusted R-squared: 0.08475
## F-statistic: 134.9 on 2 and 2890 DF, p-value: < 2.2e-16
#Including uninsured status
lm_uninsured <- lm(MEAN_PCT_LA_COUNTY ~ CHR_PCT_DIABETES + ACS_PCT_UNINSURED, data = test_counties_left)
summary(lm_uninsured)##
## Call:
## lm(formula = MEAN_PCT_LA_COUNTY ~ CHR_PCT_DIABETES + ACS_PCT_UNINSURED,
## data = test_counties_left)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.72 -15.84 -0.62 14.49 72.13
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.31668 1.94594 34.080 < 2e-16 ***
## CHR_PCT_DIABETES -2.68548 0.16198 -16.579 < 2e-16 ***
## ACS_PCT_UNINSURED 0.22596 0.07932 2.849 0.00442 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.42 on 2890 degrees of freedom
## (216 observations deleted due to missingness)
## Multiple R-squared: 0.08736, Adjusted R-squared: 0.08673
## F-statistic: 138.3 on 2 and 2890 DF, p-value: < 2.2e-16
#How about a model that accounts for all of the above variables?
#Trying...
lm_combinedvars<- lm(MEAN_PCT_LA_COUNTY ~ CHR_PCT_DIABETES + ACS_MEDIAN_HH_INC + ACS_PER_CAPITA_INC + CHR_PCT_ADULT_OBESITY + ACS_PCT_UNINSURED, data = test_counties_left)
modelsummary(lm_combinedvars, statistic = "p.value")| Model 1 | |
|---|---|
| (Intercept) | 26.533 |
| (0.000) | |
| CHR_PCT_DIABETES | −2.526 |
| (0.000) | |
| ACS_MEDIAN_HH_INC | 0.000 |
| (0.425) | |
| ACS_PER_CAPITA_INC | 0.001 |
| (0.000) | |
| CHR_PCT_ADULT_OBESITY | 0.481 |
| (0.000) | |
| ACS_PCT_UNINSURED | 0.565 |
| (0.000) | |
| Num.Obs. | 2893 |
| R2 | 0.107 |
| R2 Adj. | 0.105 |
| AIC | 25889.2 |
| BIC | 25931.0 |
| Log.Lik. | −12937.617 |
| RMSE | 21.18 |
lm_combinedvars##
## Call:
## lm(formula = MEAN_PCT_LA_COUNTY ~ CHR_PCT_DIABETES + ACS_MEDIAN_HH_INC +
## ACS_PER_CAPITA_INC + CHR_PCT_ADULT_OBESITY + ACS_PCT_UNINSURED,
## data = test_counties_left)
##
## Coefficients:
## (Intercept) CHR_PCT_DIABETES ACS_MEDIAN_HH_INC
## 2.653e+01 -2.526e+00 5.599e-05
## ACS_PER_CAPITA_INC CHR_PCT_ADULT_OBESITY ACS_PCT_UNINSURED
## 6.314e-04 4.812e-01 5.646e-01
summary(lm_combinedvars)##
## Call:
## lm(formula = MEAN_PCT_LA_COUNTY ~ CHR_PCT_DIABETES + ACS_MEDIAN_HH_INC +
## ACS_PER_CAPITA_INC + CHR_PCT_ADULT_OBESITY + ACS_PCT_UNINSURED,
## data = test_counties_left)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66.147 -15.437 -0.443 14.237 72.625
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.653e+01 5.520e+00 4.807 1.61e-06 ***
## CHR_PCT_DIABETES -2.526e+00 2.300e-01 -10.982 < 2e-16 ***
## ACS_MEDIAN_HH_INC 5.599e-05 7.014e-05 0.798 0.424811
## ACS_PER_CAPITA_INC 6.314e-04 1.555e-04 4.062 5.00e-05 ***
## CHR_PCT_ADULT_OBESITY 4.812e-01 1.249e-01 3.853 0.000119 ***
## ACS_PCT_UNINSURED 5.646e-01 8.965e-02 6.297 3.49e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.2 on 2887 degrees of freedom
## (216 observations deleted due to missingness)
## Multiple R-squared: 0.1065, Adjusted R-squared: 0.105
## F-statistic: 68.86 on 5 and 2887 DF, p-value: < 2.2e-16
#How to properly interpret these findings? #How to state conclusions appropriately?#The following are code-writing reference materials (from practicums,
homeworks), not part of final project…
############################################################## #From
reference, taken from practicum 5 - does this material help at all to
interpret the impact of other variables such as median income, obesity
etc.? # We might try a combined model to see whether the change in fev1
associated with fvc changes by sex, or whether there is an interaction
between fvc and sex. Recall from lecture that the + sex
term allows for a potential change in intercept in the line that models
fev1 ~ fvc for a given sex, while the
+ fvc:sex term allows for a potential change in slope that
models the fev1 ~ fvc relationship for a given sex.
lm1 <- lm(fev1 ~ fvc + sex, data = nhanes) lm2 <- lm(fev1 ~ fvc + sex + fvc:sex, data = nhanes) modelsummary(list(lm1, lm2), coef_omit = “Intercept”, statistic = c(“p = {p.value}”), estimate = “{estimate} [{conf.low}, {conf.high}]”) ################################################################
##############################################################
#Extra content from homework #2 for reference, possible use:
#* How many subjects had wheezing or whistling in their chest in the past year?
{r, eval = TRUE}
x<-nhanesdata$RDQ070
count1<- length(which(x==1))
count1
#* How many adults are part of the study (with adults defined as age of 18 years or greater)?
{r, eval = TRUE}
y<-nhanesdata$RIDAGEYR
countadult<- length(which(y>= 18))
countadult
#From homework #3
# + How many flights in the dataset had destination of Miami International Airport (MIA)? Which airport had the most departing flights to MIA?
{r, eval=TRUE}
library(nycflights13)
table(flights$dest == "MIA")
table(flights$dest == 'MIA', flights$origin)
{r, eval = TRUE}
library(dplyr) listone<- dplyr::select(flights, -year, - month, -day,
-dep_time, -sched_dep_time, -dep_delay, -arr_time, -sched_arr_time,
-arr_delay, -flight, -dest, -distance, -hour, -minute) listtwo<-
dplyr::filter(listone, origin == ‘JFK’ | origin == ‘LGA’, air_time
>120) %>% group_by(tailnum, carrier) listthree<-
dplyr::filter(listtwo, (grepl(‘JB’, tailnum))) %>% group_by(carrier)
listthree %>% count(carrier)
# + List the top 10 destinations based on having the most flights to each, along with the number of flights to each destination. Next, list the destination with the most flights per month, along with the corresponding number of flights for each month.
{r, eval=TRUE}
flights %>% count(dest)
destcount<- flights %>% count(dest)
head(destcount)
dplyr::arrange(destcount, desc(n))
flfreq<- flights %>% dplyr::group_by(month, dest) %>% dplyr::summarise(destfreq= n())
flfreq
arrange(flfreq, desc(destfreq))
top_1<- flfreq %>% dplyr::slice_max(destfreq, n = 1)
top_1
+ Was the mean arrival delay (arr_delay) per carrier related to the total number of flights per carrier? The answer should show a plot and use one sentence to address this question qualitatively. That is, there is no need to create a model to formally evaluate a relationship.
{r, eval = TRUE} library(ggplot2) mean_arr_delay<- flights %>% dplyr::group_by(carrier) %>% summarise(mean_delay = mean(arr_delay, na.rm = TRUE)) mean_arr_delay
tot_fl_carrier<- flights %>% dplyr::select(carrier, flight) %>% group_by(carrier) %>% summarise(totflights = n()) tot_fl_carrier
fulljoin <- dplyr::full_join(mean_arr_delay, tot_fl_carrier) fulljoin
p <- ggplot(fulljoin, aes(x = totflights, y = mean_delay)) + geom_point() + geom_smooth(method= “lm”, color = “red”) p ########################################################### END OF DRAFT ###